# WACV-2024-Papers

<table>
    <tr>
        <td><strong>Application</strong></td>
        <td>
            <a href="https://huggingface.co/spaces/DmitryRyumin/NewEraAI-Papers" style="float:left;">
                <img src="https://img.shields.io/badge/🤗-NewEraAI--Papers-FFD21F.svg" alt="App" />
            </a>
        </td>
    </tr>
</table>

<div align="center">
    <a href="https://github.com/DmitryRyumin/WACV-2024-Papers/blob/main/sections/2024/main/generative_models_for_image_video_3d.md">
        <img src="https://cdn.jsdelivr.net/gh/DmitryRyumin/NewEraAI-Papers@main/images/left.svg" width="40" alt="" />
    </a>
    <a href="https://github.com/DmitryRyumin/WACV-2024-Papers/">
        <img src="https://cdn.jsdelivr.net/gh/DmitryRyumin/NewEraAI-Papers@main/images/home.svg" width="40" alt="" />
    </a>
    <a href="https://github.com/DmitryRyumin/WACV-2024-Papers/blob/main/sections/2024/main/visualization.md">
        <img src="https://cdn.jsdelivr.net/gh/DmitryRyumin/NewEraAI-Papers@main/images/right.svg" width="40" alt="" />
    </a>
</div>

## Vision + Language and/or Other Modalities

![Section Papers](https://img.shields.io/badge/Section%20Papers-34-42BA16) ![Preprint Papers](https://img.shields.io/badge/Preprint%20Papers-23-b31b1b) ![Papers with Open Code](https://img.shields.io/badge/Papers%20with%20Open%20Code-19-1D7FBF) ![Papers with Video](https://img.shields.io/badge/Papers%20with%20Video-29-FF0000)

| **Title** | **Repo** | **Paper** | **Video** |
|-----------|:--------:|:---------:|:---------:|
| [Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models](https://openaccess.thecvf.com/content/WACV2024/html/Luo_Zero-Shot_Video_Moment_Retrieval_From_Frozen_Vision-Language_Models_WACV_2024_paper.html) | :heavy_minus_sign: | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Luo_Zero-Shot_Video_Moment_Retrieval_From_Frozen_Vision-Language_Models_WACV_2024_paper.pdf) <br /> [![arXiv](https://img.shields.io/badge/arXiv-2309.00661-b31b1b.svg)](http://arxiv.org/abs/2309.00661) | [![YouTube](https://img.shields.io/badge/YouTube-%23FF0000.svg?style=for-the-badge&logo=YouTube&logoColor=white)](https://www.youtube.com/watch?v=lOkj4h4_0Ic) |
| [Investigating the Role of Attribute Context in Vision-Language Models for Object Recognition and Detection](https://openaccess.thecvf.com/content/WACV2024/html/Buettner_Investigating_the_Role_of_Attribute_Context_in_Vision-Language_Models_for_WACV_2024_paper.html) | [![GitHub](https://img.shields.io/github/stars/krbuettner/attributes_and_vlms?style=flat)](https://github.com/krbuettner/attributes_and_vlms) | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Buettner_Investigating_the_Role_of_Attribute_Context_in_Vision-Language_Models_for_WACV_2024_paper.pdf) <br /> [![arXiv](https://img.shields.io/badge/arXiv-2303.10093-b31b1b.svg)](http://arxiv.org/abs/2303.10093) | :heavy_minus_sign: |
| [Benchmarking Out-of-Distribution Detection in Visual Question Answering](https://openaccess.thecvf.com/content/WACV2024/html/Shi_Benchmarking_Out-of-Distribution_Detection_in_Visual_Question_Answering_WACV_2024_paper.html) | [![GitHub](https://img.shields.io/github/stars/Sxx1995/Benchmarking-Out-of-Distribution-Detection-in-Visual-Question-Answering?style=flat)](https://github.com/Sxx1995/Benchmarking-Out-of-Distribution-Detection-in-Visual-Question-Answering) | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Shi_Benchmarking_Out-of-Distribution_Detection_in_Visual_Question_Answering_WACV_2024_paper.pdf) | [![YouTube](https://img.shields.io/badge/YouTube-%23FF0000.svg?style=for-the-badge&logo=YouTube&logoColor=white)](https://www.youtube.com/watch?v=Pj23xLYGt-Y) |
| [Sound3DVDet: 3D Sound Source Detection using Multiview Microphone Array and RGB Images](https://openaccess.thecvf.com/content/WACV2024/html/He_Sound3DVDet_3D_Sound_Source_Detection_Using_Multiview_Microphone_Array_and_WACV_2024_paper.html) | [![GitHub](https://img.shields.io/github/stars/yuhanghe01/Sound3DVDet?style=flat)](https://github.com/yuhanghe01/Sound3DVDet) | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/He_Sound3DVDet_3D_Sound_Source_Detection_Using_Multiview_Microphone_Array_and_WACV_2024_paper.pdf) | [![YouTube](https://img.shields.io/badge/YouTube-%23FF0000.svg?style=for-the-badge&logo=YouTube&logoColor=white)](https://www.youtube.com/watch?v=kyRcuBrEcW0) |
| [LAVSS: Location-Guided Audio-Visual Spatial Audio Separation](https://openaccess.thecvf.com/content/WACV2024/html/Ye_LAVSS_Location-Guided_Audio-Visual_Spatial_Audio_Separation_WACV_2024_paper.html) | [![GitHub Page](https://img.shields.io/badge/GitHub-Page-159957.svg)](https://yyx666660.github.io/LAVSS/) <br /> [![GitHub](https://img.shields.io/github/stars/YYX666660/LAVSS?style=flat)](https://github.com/YYX666660/LAVSS) | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Ye_LAVSS_Location-Guided_Audio-Visual_Spatial_Audio_Separation_WACV_2024_paper.pdf) <br /> [![arXiv](https://img.shields.io/badge/arXiv-2310.20446-b31b1b.svg)](http://arxiv.org/abs/2310.20446) | [![YouTube](https://img.shields.io/badge/YouTube-%23FF0000.svg?style=for-the-badge&logo=YouTube&logoColor=white)](https://www.youtube.com/watch?v=ux8aX6vddDw) |
| [Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-based Vision and Language Models](https://openaccess.thecvf.com/content/WACV2024/html/Yi_Augment_the_Pairs_Semantics-Preserving_Image-Caption_Pair_Augmentation_for_Grounding-Based_Vision_WACV_2024_paper.html) | [![GitHub](https://img.shields.io/github/stars/amzn/augment-the-pairs-wacv2024?style=flat)](https://github.com/amzn/augment-the-pairs-wacv2024) | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Yi_Augment_the_Pairs_Semantics-Preserving_Image-Caption_Pair_Augmentation_for_Grounding-Based_Vision_WACV_2024_paper.pdf) <br /> [![arXiv](https://img.shields.io/badge/arXiv-2311.02536-b31b1b.svg)](http://arxiv.org/abs/2311.02536) | :heavy_minus_sign: |
| [CLID: Controlled-Length Image Descriptions with Limited Data](https://openaccess.thecvf.com/content/WACV2024/html/Hirsch_CLID_Controlled-Length_Image_Descriptions_With_Limited_Data_WACV_2024_paper.html) | [![GitHub](https://img.shields.io/github/stars/Eladhi/CLID?style=flat)](https://github.com/Eladhi/CLID) | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Hirsch_CLID_Controlled-Length_Image_Descriptions_With_Limited_Data_WACV_2024_paper.pdf) <br /> [![arXiv](https://img.shields.io/badge/arXiv-2211.14835-b31b1b.svg)](http://arxiv.org/abs/2211.14835) | [![YouTube](https://img.shields.io/badge/YouTube-%23FF0000.svg?style=for-the-badge&logo=YouTube&logoColor=white)](https://www.youtube.com/watch?v=ca4jiim9D5g) |
| [STYLIP: Multi-Scale Style-Conditioned Prompt Learning for CLIP-based Domain Generalization](https://openaccess.thecvf.com/content/WACV2024/html/Bose_STYLIP_Multi-Scale_Style-Conditioned_Prompt_Learning_for_CLIP-Based_Domain_Generalization_WACV_2024_paper.html) | :heavy_minus_sign: | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Bose_STYLIP_Multi-Scale_Style-Conditioned_Prompt_Learning_for_CLIP-Based_Domain_Generalization_WACV_2024_paper.pdf) <br /> [![arXiv](https://img.shields.io/badge/arXiv-2302.09251-b31b1b.svg)](http://arxiv.org/abs/2302.09251) | [![YouTube](https://img.shields.io/badge/YouTube-%23FF0000.svg?style=for-the-badge&logo=YouTube&logoColor=white)](https://www.youtube.com/watch?v=NGQgE1q3wrs) |
| [THInImg: Cross-Modal Steganography for Presenting Talking Heads in Images](https://openaccess.thecvf.com/content/WACV2024/html/Zhao_THInImg_Cross-Modal_Steganography_for_Presenting_Talking_Heads_in_Images_WACV_2024_paper.html) |:heavy_minus_sign: | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Zhao_THInImg_Cross-Modal_Steganography_for_Presenting_Talking_Heads_in_Images_WACV_2024_paper.pdf) <br /> [![arXiv](https://img.shields.io/badge/arXiv-2311.17177-b31b1b.svg)](http://arxiv.org/abs/2311.17177) | [![YouTube](https://img.shields.io/badge/YouTube-%23FF0000.svg?style=for-the-badge&logo=YouTube&logoColor=white)](https://www.youtube.com/watch?v=YAFSDjDsxVM) |
| [Enhancing Multimodal Compositional Reasoning of Visual Language Models with Generative Negative Mining](https://openaccess.thecvf.com/content/WACV2024/html/Sahin_Enhancing_Multimodal_Compositional_Reasoning_of_Visual_Language_Models_With_Generative_WACV_2024_paper.html) | [![GitHub Page](https://img.shields.io/badge/GitHub-Page-159957.svg)](https://ugorsahin.github.io/enhancing-multimodal-compositional-reasoning-of-vlm.html) <br /> [![GitHub](https://img.shields.io/github/stars/ugorsahin/Generative-Negative-Mining?style=flat)](https://github.com/ugorsahin/Generative-Negative-Mining) | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Sahin_Enhancing_Multimodal_Compositional_Reasoning_of_Visual_Language_Models_With_Generative_WACV_2024_paper.pdf) <br /> [![arXiv](https://img.shields.io/badge/arXiv-2311.03964-b31b1b.svg)](http://arxiv.org/abs/2311.03964) | :heavy_minus_sign: |
| [Temporal Context Enhanced Referring Video Object Segmentation](https://openaccess.thecvf.com/content/WACV2024/html/Hu_Temporal_Context_Enhanced_Referring_Video_Object_Segmentation_WACV_2024_paper.html) | [![GitHub](https://img.shields.io/github/stars/haliphinx/TCE-RVOS?style=flat)](https://github.com/haliphinx/TCE-RVOS) | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Hu_Temporal_Context_Enhanced_Referring_Video_Object_Segmentation_WACV_2024_paper.pdf) | [![YouTube](https://img.shields.io/badge/YouTube-%23FF0000.svg?style=for-the-badge&logo=YouTube&logoColor=white)](https://www.youtube.com/watch?v=3YXG4kZiaZA) |
| [Fine-Grained Alignment for Cross-Modal Recipe Retrieval](https://openaccess.thecvf.com/content/WACV2024/html/Wahed_Fine-Grained_Alignment_for_Cross-Modal_Recipe_Retrieval_WACV_2024_paper.html) | [![GitHub](https://img.shields.io/github/stars/PLAN-Lab/FARM?style=flat)](https://github.com/PLAN-Lab/FARM) | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Wahed_Fine-Grained_Alignment_for_Cross-Modal_Recipe_Retrieval_WACV_2024_paper.pdf) | [![YouTube](https://img.shields.io/badge/YouTube-%23FF0000.svg?style=for-the-badge&logo=YouTube&logoColor=white)](https://www.youtube.com/watch?v=1YrmDF6eYto) |
| [Learning to Adapt CLIP for Few-Shot Monocular Depth Estimation](https://openaccess.thecvf.com/content/WACV2024/html/Hu_Learning_To_Adapt_CLIP_for_Few-Shot_Monocular_Depth_Estimation_WACV_2024_paper.html) | :heavy_minus_sign: | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Hu_Learning_To_Adapt_CLIP_for_Few-Shot_Monocular_Depth_Estimation_WACV_2024_paper.pdf) <br /> [![arXiv](https://img.shields.io/badge/arXiv-2311.01034-b31b1b.svg)](http://arxiv.org/abs/2311.01034) | [![YouTube](https://img.shields.io/badge/YouTube-%23FF0000.svg?style=for-the-badge&logo=YouTube&logoColor=white)](https://www.youtube.com/watch?v=EA4M1vgQkFk) |
| [Annotation-Free Audio-Visual Segmentation](https://openaccess.thecvf.com/content/WACV2024/html/Liu_Annotation-Free_Audio-Visual_Segmentation_WACV_2024_paper.html) | [![GitHub Page](https://img.shields.io/badge/GitHub-Page-159957.svg)](https://jinxiang-liu.github.io/anno-free-AVS/) <br /> [![GitHub](https://img.shields.io/github/stars/jinxiang-liu/anno-free-AVS?style=flat)](https://github.com/jinxiang-liu/anno-free-AVS) | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Liu_Annotation-Free_Audio-Visual_Segmentation_WACV_2024_paper.pdf) <br /> [![arXiv](https://img.shields.io/badge/arXiv-2305.11019-b31b1b.svg)](http://arxiv.org/abs/2305.11019) | [![YouTube](https://img.shields.io/badge/YouTube-%23FF0000.svg?style=for-the-badge&logo=YouTube&logoColor=white)](https://www.youtube.com/watch?v=-FF_3SDOsaU) |
| [Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video Parsing](https://openaccess.thecvf.com/content/WACV2024/html/Xu_Rethink_Cross-Modal_Fusion_in_Weakly-Supervised_Audio-Visual_Video_Parsing_WACV_2024_paper.html) | :heavy_minus_sign: | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Xu_Rethink_Cross-Modal_Fusion_in_Weakly-Supervised_Audio-Visual_Video_Parsing_WACV_2024_paper.pdf) <br /> [![arXiv](https://img.shields.io/badge/arXiv-2311.08151-b31b1b.svg)](http://arxiv.org/abs/2311.08151) | [![YouTube](https://img.shields.io/badge/YouTube-%23FF0000.svg?style=for-the-badge&logo=YouTube&logoColor=white)](https://www.youtube.com/watch?v=Qv2aPoUA_zQ) |
| [SDNet: An Extremely Efficient Portrait Matting Model via Self-Distillation](https://openaccess.thecvf.com/content/WACV2024/html/Li_SDNet_An_Extremely_Efficient_Portrait_Matting_Model_via_Self-Distillation_WACV_2024_paper.html) | :heavy_minus_sign: | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Li_SDNet_An_Extremely_Efficient_Portrait_Matting_Model_via_Self-Distillation_WACV_2024_paper.pdf) | [![YouTube](https://img.shields.io/badge/YouTube-%23FF0000.svg?style=for-the-badge&logo=YouTube&logoColor=white)](https://www.youtube.com/watch?v=iZzG04VXUCc) |
| [FELGA: Unsupervised Fragment Embedding for Fine-Grained Cross-Modal Association](https://openaccess.thecvf.com/content/WACV2024/html/Zhuo_FELGA_Unsupervised_Fragment_Embedding_for_Fine-Grained_Cross-Modal_Association_WACV_2024_paper.html) | :heavy_minus_sign: | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Zhuo_FELGA_Unsupervised_Fragment_Embedding_for_Fine-Grained_Cross-Modal_Association_WACV_2024_paper.pdf) | [![YouTube](https://img.shields.io/badge/YouTube-%23FF0000.svg?style=for-the-badge&logo=YouTube&logoColor=white)](https://www.youtube.com/watch?v=rs6rRc_o2Ok) |
| [Modality-Aware Representation Learning for Zero-Shot Sketch-based Image Retrieval](https://openaccess.thecvf.com/content/WACV2024/html/Lyou_Modality-Aware_Representation_Learning_for_Zero-Shot_Sketch-Based_Image_Retrieval_WACV_2024_paper.html) | :heavy_minus_sign: | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Lyou_Modality-Aware_Representation_Learning_for_Zero-Shot_Sketch-Based_Image_Retrieval_WACV_2024_paper.pdf) <br /> [![arXiv](https://img.shields.io/badge/arXiv-2401.04860-b31b1b.svg)](http://arxiv.org/abs/2401.04860) | [![YouTube](https://img.shields.io/badge/YouTube-%23FF0000.svg?style=for-the-badge&logo=YouTube&logoColor=white)](https://www.youtube.com/watch?v=5KsTetjNgCQ) |
| [Multitask Vision-Language Prompt Tuning](https://openaccess.thecvf.com/content/WACV2024/html/Shen_Multitask_Vision-Language_Prompt_Tuning_WACV_2024_paper.html) | [![GitHub](https://img.shields.io/github/stars/sIncerass/MVLPT?style=flat)](https://github.com/sIncerass/MVLPT) | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Shen_Multitask_Vision-Language_Prompt_Tuning_WACV_2024_paper.pdf) <br /> [![arXiv](https://img.shields.io/badge/arXiv-2211.11720-b31b1b.svg)](http://arxiv.org/abs/2211.11720) | [![YouTube](https://img.shields.io/badge/YouTube-%23FF0000.svg?style=for-the-badge&logo=YouTube&logoColor=white)](https://www.youtube.com/watch?v=f9vjQxFBHU8) |
| [EASUM: Enhancing Affective State Understanding through Joint Sentiment and Emotion Modeling for Multimodal Tasks](https://openaccess.thecvf.com/content/WACV2024/html/Hwang_EASUM_Enhancing_Affective_State_Understanding_Through_Joint_Sentiment_and_Emotion_WACV_2024_paper.html) | :heavy_minus_sign: | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Hwang_EASUM_Enhancing_Affective_State_Understanding_Through_Joint_Sentiment_and_Emotion_WACV_2024_paper.pdf) | :heavy_minus_sign: |
| [Complementary-Contradictory Feature Regularization Against Multimodal Overfitting](https://openaccess.thecvf.com/content/WACV2024/html/Tejero-de-Pablos_Complementary-Contradictory_Feature_Regularization_Against_Multimodal_Overfitting_WACV_2024_paper.html) | [![GitHub](https://img.shields.io/github/stars/CyberAgentAILab/CM-VQVAE?style=flat)](https://github.com/CyberAgentAILab/CM-VQVAE) | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Tejero-de-Pablos_Complementary-Contradictory_Feature_Regularization_Against_Multimodal_Overfitting_WACV_2024_paper.pdf) | [![YouTube](https://img.shields.io/badge/YouTube-%23FF0000.svg?style=for-the-badge&logo=YouTube&logoColor=white)](https://www.youtube.com/watch?v=R7BiKXBa0ZY) |
| [FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions](https://openaccess.thecvf.com/content/WACV2024/html/Rotstein_FuseCap_Leveraging_Large_Language_Models_for_Enriched_Fused_Image_Captions_WACV_2024_paper.html) | :heavy_minus_sign: | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Rotstein_FuseCap_Leveraging_Large_Language_Models_for_Enriched_Fused_Image_Captions_WACV_2024_paper.pdf) <br /> [![arXiv](https://img.shields.io/badge/arXiv-2305.17718-b31b1b.svg)](http://arxiv.org/abs/2305.17718) | [![YouTube](https://img.shields.io/badge/YouTube-%23FF0000.svg?style=for-the-badge&logo=YouTube&logoColor=white)](https://www.youtube.com/watch?v=xXUCTqmF_q4) |
| [Describe Images in a Boring Way: Towards Cross-Modal Sarcasm Generation](https://openaccess.thecvf.com/content/WACV2024/html/Ruan_Describe_Images_in_a_Boring_Way_Towards_Cross-Modal_Sarcasm_Generation_WACV_2024_paper.html) | [![GitHub](https://img.shields.io/github/stars/EnablerRx/CMSG-EGRM?style=flat)](https://github.com/EnablerRx/CMSG-EGRM) | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Ruan_Describe_Images_in_a_Boring_Way_Towards_Cross-Modal_Sarcasm_Generation_WACV_2024_paper.pdf) | [![YouTube](https://img.shields.io/badge/YouTube-%23FF0000.svg?style=for-the-badge&logo=YouTube&logoColor=white)](https://www.youtube.com/watch?v=1Jcwau6VxVY) |
| [Can CLIP Help Sound Source Localization?](https://openaccess.thecvf.com/content/WACV2024/html/Park_Can_CLIP_Help_Sound_Source_Localization_WACV_2024_paper.html) | [![GitHub](https://img.shields.io/github/stars/swimmiing/ACL-SSL?style=flat)](https://github.com/swimmiing/ACL-SSL) <br /> [![HF](https://img.shields.io/badge/🤗-demo-FFD21F.svg)](https://huggingface.co/spaces/swimmiing/ACL-SSL-zeroshot-demo) | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Park_Can_CLIP_Help_Sound_Source_Localization_WACV_2024_paper.pdf) <br /> [![arXiv](https://img.shields.io/badge/arXiv-2311.04066-b31b1b.svg)](http://arxiv.org/abs/2311.04066) | [![YouTube](https://img.shields.io/badge/YouTube-%23FF0000.svg?style=for-the-badge&logo=YouTube&logoColor=white)](https://www.youtube.com/watch?v=hgxqz9ww4rU) |
| [Domain Aligned CLIP for Few-Shot Classification](https://openaccess.thecvf.com/content/WACV2024/html/Gondal_Domain_Aligned_CLIP_for_Few-Shot_Classification_WACV_2024_paper.html) | :heavy_minus_sign: | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Gondal_Domain_Aligned_CLIP_for_Few-Shot_Classification_WACV_2024_paper.pdf) <br /> [![arXiv](https://img.shields.io/badge/arXiv-2311.09191-b31b1b.svg)](http://arxiv.org/abs/2311.09191) | [![YouTube](https://img.shields.io/badge/YouTube-%23FF0000.svg?style=for-the-badge&logo=YouTube&logoColor=white)](https://www.youtube.com/watch?v=3sw6UWcSyNI) |
| [SCoRD: Subject-Conditional Relation Detection with Text-Augmented Data](https://openaccess.thecvf.com/content/WACV2024/html/Yang_SCoRD_Subject-Conditional_Relation_Detection_With_Text-Augmented_Data_WACV_2024_paper.html) | [![GitHub](https://img.shields.io/github/stars/uvavision/SCoRD?style=flat)](https://github.com/uvavision/SCoRD) | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Yang_SCoRD_Subject-Conditional_Relation_Detection_With_Text-Augmented_Data_WACV_2024_paper.pdf) <br /> [![arXiv](https://img.shields.io/badge/arXiv-2308.12910-b31b1b.svg)](http://arxiv.org/abs/2308.12910) | :heavy_minus_sign: |
| [Simple Token-Level Confidence Improves Caption Correctness](https://openaccess.thecvf.com/content/WACV2024/html/Petryk_Simple_Token-Level_Confidence_Improves_Caption_Correctness_WACV_2024_paper.html) | :heavy_minus_sign: | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Petryk_Simple_Token-Level_Confidence_Improves_Caption_Correctness_WACV_2024_paper.pdf) <br /> [![arXiv](https://img.shields.io/badge/arXiv-2305.07021-b31b1b.svg)](http://arxiv.org/abs/2305.07021) | [![YouTube](https://img.shields.io/badge/YouTube-%23FF0000.svg?style=for-the-badge&logo=YouTube&logoColor=white)](https://www.youtube.com/watch?v=iDq7V4pMgTo) |
| [Bi-Directional Training for Composed Image Retrieval via Text Prompt Learning](https://openaccess.thecvf.com/content/WACV2024/html/Liu_Bi-Directional_Training_for_Composed_Image_Retrieval_via_Text_Prompt_Learning_WACV_2024_paper.html) | [![GitHub](https://img.shields.io/github/stars/Cuberick-Orion/Bi-Blip4CIR?style=flat)](https://github.com/Cuberick-Orion/Bi-Blip4CIR) | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Liu_Bi-Directional_Training_for_Composed_Image_Retrieval_via_Text_Prompt_Learning_WACV_2024_paper.pdf) <br /> [![arXiv](https://img.shields.io/badge/arXiv-2303.16604-b31b1b.svg)](http://arxiv.org/abs/2303.16604) | [![YouTube](https://img.shields.io/badge/YouTube-%23FF0000.svg?style=for-the-badge&logo=YouTube&logoColor=white)](https://www.youtube.com/watch?v=kCYzsmekweg) |
| [MOPA: Modular Object Navigation with PointGoal Agents](https://openaccess.thecvf.com/content/WACV2024/html/Raychaudhuri_MOPA_Modular_Object_Navigation_With_PointGoal_Agents_WACV_2024_paper.html) | [![GitHub Page](https://img.shields.io/badge/GitHub-Page-159957.svg)](https://3dlg-hcvc.github.io/mopa/) <br /> [![GitHub](https://img.shields.io/github/stars/3dlg-hcvc/mopa?style=flat)](https://github.com/3dlg-hcvc/mopa) | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Raychaudhuri_MOPA_Modular_Object_Navigation_With_PointGoal_Agents_WACV_2024_paper.pdf) <br /> [![arXiv](https://img.shields.io/badge/arXiv-2304.03696-b31b1b.svg)](http://arxiv.org/abs/2304.03696) | [![YouTube](https://img.shields.io/badge/YouTube-%23FF0000.svg?style=for-the-badge&logo=YouTube&logoColor=white)](https://www.youtube.com/watch?v=Jcspov0UpsA) |
| [GIPCOL: Graph-Injected Soft Prompting for Compositional Zero-Shot Learning](https://openaccess.thecvf.com/content/WACV2024/html/Xu_GIPCOL_Graph-Injected_Soft_Prompting_for_Compositional_Zero-Shot_Learning_WACV_2024_paper.html) | [![GitHub](https://img.shields.io/github/stars/HLR/GIPCOL?style=flat)](https://github.com/HLR/GIPCOL) | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Xu_GIPCOL_Graph-Injected_Soft_Prompting_for_Compositional_Zero-Shot_Learning_WACV_2024_paper.pdf) <br /> [![arXiv](https://img.shields.io/badge/arXiv-2311.05729-b31b1b.svg)](http://arxiv.org/abs/2311.05729) | [![YouTube](https://img.shields.io/badge/YouTube-%23FF0000.svg?style=for-the-badge&logo=YouTube&logoColor=white)](https://www.youtube.com/watch?v=SISClycr5hg) |
| [Text-Guided Face Recognition using Multi-Granularity Cross-Modal Contrastive Learning](https://openaccess.thecvf.com/content/WACV2024/html/Hasan_Text-Guided_Face_Recognition_Using_Multi-Granularity_Cross-Modal_Contrastive_Learning_WACV_2024_paper.html) | :heavy_minus_sign: | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Hasan_Text-Guided_Face_Recognition_Using_Multi-Granularity_Cross-Modal_Contrastive_Learning_WACV_2024_paper.pdf) <br /> [![arXiv](https://img.shields.io/badge/arXiv-2312.09367-b31b1b.svg)](http://arxiv.org/abs/2312.09367) | [![YouTube](https://img.shields.io/badge/YouTube-%23FF0000.svg?style=for-the-badge&logo=YouTube&logoColor=white)](https://www.youtube.com/watch?v=ZilneI4BDNU) |
| [Leveraging Task-Specific Pre-Training to Reason Across Images and Videos](https://openaccess.thecvf.com/content/WACV2024/html/Sadhu_Leveraging_Task-Specific_Pre-Training_To_Reason_Across_Images_and_Videos_WACV_2024_paper.html) | :heavy_minus_sign: | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Sadhu_Leveraging_Task-Specific_Pre-Training_To_Reason_Across_Images_and_Videos_WACV_2024_paper.pdf) | [![YouTube](https://img.shields.io/badge/YouTube-%23FF0000.svg?style=for-the-badge&logo=YouTube&logoColor=white)](https://www.youtube.com/watch?v=PZSJhrNnKAo) |
| [VD-GR: Boosting Visual Dialog with Cascaded Spatial-Temporal Multi-Modal Graphs](https://openaccess.thecvf.com/content/WACV2024/html/Abdessaied_VD-GR_Boosting_Visual_Dialog_With_Cascaded_Spatial-Temporal_Multi-Modal_Graphs_WACV_2024_paper.html) | [![WEB Page](https://img.shields.io/badge/WEB-Page-159957.svg)](https://perceptualui.org/publications/abdessaied24_wacv/) | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Abdessaied_VD-GR_Boosting_Visual_Dialog_With_Cascaded_Spatial-Temporal_Multi-Modal_Graphs_WACV_2024_paper.pdf) | [![YouTube](https://img.shields.io/badge/YouTube-%23FF0000.svg?style=for-the-badge&logo=YouTube&logoColor=white)](https://www.youtube.com/watch?v=8JLE_2lGjjw) |
| [TriCoLo: Trimodal Contrastive Loss for Text to Shape Retrieval](https://openaccess.thecvf.com/content/WACV2024/html/Ruan_TriCoLo_Trimodal_Contrastive_Loss_for_Text_To_Shape_Retrieval_WACV_2024_paper.html) | [![GitHub Page](https://img.shields.io/badge/GitHub-Page-159957.svg)](https://3dlg-hcvc.github.io/tricolo/) <br /> [![GitHub](https://img.shields.io/github/stars/3dlg-hcvc/tricolo?style=flat)](https://github.com/3dlg-hcvc/tricolo) | [![thecvf](https://img.shields.io/badge/pdf-thecvf-7395C5.svg)](https://openaccess.thecvf.com/content/WACV2024/papers/Ruan_TriCoLo_Trimodal_Contrastive_Loss_for_Text_To_Shape_Retrieval_WACV_2024_paper.pdf) <br /> [![arXiv](https://img.shields.io/badge/arXiv-2201.07366-b31b1b.svg)](http://arxiv.org/abs/2201.07366) | [![YouTube](https://img.shields.io/badge/YouTube-%23FF0000.svg?style=for-the-badge&logo=YouTube&logoColor=white)](https://www.youtube.com/watch?v=4YK65qDUUJs) |
